Statistical inference using weights and survey design

1. Survey Design and Statistical inference

Pierre Walthéry

UK Data Service

October 2025

Introductions

Housekeeping

  • Toilets/fire exits
  • Coffee / refreshments and lunch
  • Four 50 minutes sessions
  • Don’t forget to get the data
  • Your feedback is very important!

Leave us your feedback

Plan of the day

11.00-11.15 Introductions

11.15-12.00 Session 1: Survey Design: a refresher

Coffee break

12.15-13.00 Session 2: Inference in theory and practice

Lunch break

13.45-14.30 Session 3: R and Stata examples (1)

Coffee break

14.45-15.30 Session 4: R and Stata examples (2)

About the UK Data Service

The UK Data Service in a nutshell

  • Main repository of UK secondary social science data

  • A provider of support, training and guidance

  • Freely accessible, funded by the ESRC

  • Who are we for?

    • Academic researchers and students
    • Government analysts
    • Charities and the voluntary sector
    • Business consultants
    • Independent research centres / think tanks

Data curated

  • Survey microdata:

    • Large-scale cross-sectional UK government surveys series
    • Major UK longitudinal surveys
    • Multinational survey data
  • Aggregate databases: i.e. OECD, etc…

  • Census data – modern and historic records

  • Business and administrative microdata

  • One off surveys, multimedia and qualitative data deposited on Reshare

User support and training

  • Helpdesk for data-related queries
  • Webinars and online workshops:
    • datasets, methods, and software focused
  • Online learning materials: Data Skills Modules and pathways
  • ‘Traditional’ survey-related and new forms of data ie computational social science.
  • Supporting data literacy among undergraduate students

Good to know

What this workshop is…

  • Second of a series of introductory workshops on using weights and survey design information to compute accurate population estimates
  • Will be used to design asynchronous material
  • Practical, everyday users-oriented
  • Plain English: (almost) no formula
  • New approach: community of practice

And isn’t

  • A full course on complex survey design/statistical inference
  • An introduction to R / Stata
  • An introduction to advanced topics of statistical inference
  • Quick, easy to use recipes for computing weighted estimates

tl;dr: some basic degree of familiarity with the topic is assumed

1. Survey design and statistical inference - a refresher

Basics of Survey Design

{fig-alt: “Stylised image of the circular relationship between inference and sampling .nostretch width=”80%” fig-align=“center”}

Samples and sampling

  • Survey design: strategies used to collect samples.

  • Sample members can either be selected:

    • purposively, for example when internet users are self-selecting to take part in an online poll
    • randomly (all members of the target population have a non-zero chance of selection), AKA probability sampling.
  • The process of deriving population estimates from a sample is called inference

Probability samples and sampling

  • Random sampling minimises the risk of obtaining unrepresentative samples and biased population estimates:

    • ie when certain groups are under represented or excluded.
  • Simple random sampling - directly drawing members of the target population - is considered the best way to conduct RS and avoid bias…

  • … But difficult to achieve: need a list of the population (national register).

  • … Not necessarily optimal:

    • when some groups have different probabilities of taking part in surveys than others
    • when smaller groups need to be over-represented to obtain more reliable and precise statistics.
    • costly, especially in low population density areas

The survey design problem

  • Designing surveys entails striking a balance between:

    • representativeness and precision (ie sample size)…
    • while keeping costs down (surveys are expensive!).
  • Large scale social surveys tend to rely on techniques better at striking this balance than SRS

    • To ensure correct representation of each country of the UK (ie by drawing separate samples for each one of them).
    • To improve the precision of estimates estimates for certain groups, for example ethnic minorities, and these need adequate sample size.
  • In effect SRS is more of a theoretical possibility than a real world sampling technique for social surveys

    • Complex sample design –> everything else - perhaps not the best label!

Common ‘complex’ sampling techniques

  • Clustering

    • Consists in dividing the population into groups as internally heterogeneous as possible - ie ‘mini populations’, some of which are then randomly drawn while others are left out.
    • Comes together with multistage sampling: i.e. drawing sample units in several steps rather than all at once
  • Stratifying also consists in dividing the population into groups according to predetermined characteristics, but this time units are drawn from all of them

  • Adjusting sampling proportion ie sampling more/less than the rest of the population for some groups

The UK context

  • No sampling frame ie national register of the population

  • The closest to it (in Great-Britain) is Royal Mail’s Postcode Address File: i.e. a structured list of addresses

  • For Northern Ireland the most commonly used is the Land and Property Services Agency’s (LPSA).

  • We cannot use the PAF to directly draw samples of households or individuals, as their number at each address is not know.

  • However, the structure of the PAF easily enables geographical clustering of surveys. Addresses, or ’delivery points’ cluster into larger units

  • Also addresses receiving unusually large amounts of mail - likely to be businesses or institutions can be filtered out

Example

  • The post code M13 9PL is embedded within the the M13 ‘postcode district’ and the M13 9 ‘postcode sector’.
  • Survey designs often use either postcode sectors or districts as Primary Sampling Units (PSUs) to reduce fieldwork costs and time.

Example

The previous figure illustrates clustering with four districts:

  • The higher level clusters, i.e. those at which the first random draw happened, are the Primary Sampling Units (PSUs).

  • Districts 1 and 4 have been selected to be in the sample.

  • A second stage of sampling follows: addresses are sampled from within the two selected districts

  • Subsequent drawing of either:

    • further clusters, for example, households or
    • final individual sample members
  • In large scale surveys the PSUs are often geographical areas.

Household level clustering - 1

  • Arises in some large-scale household surveys such as the Labour Force Survey.

  • Imagine:

    • Estimating the proportion of individuals born abroad…
    • … from a population of 100 people in 50 households.
    • … if sampling one in ten household
  • Those born abroad are more likely to live together ‘clustered’ within households, than spread randomly.

  • Some households are wholly overseas born, some mixed and most wholly UK born.

e.g. 

Household 1: 1 UK born individuals 
Household 2: 3 UK born 
Household 3: 2 Overseas born 
Household 4: 6 UK born 
Household 5: 1 Overseas born, 1 UK born 
Household 6: 2 UK born 
Household 7: 1 UK born 
Household 8: 1 UK born 
Household 9: 5 Overseas born 
Household 10: 3 UK born 

Household level clustering - 2

And so on…

  • Clustering within households means that if we draw one in ten of the households for our sample we might expect the sample to be less accurate in predicting the proportion of our population who were born outside the UK than if we had sampled individuals at random.

  • Clustering comes at the cost of making the sample coarser - as we shrink the size of the population from which it is drawn - reducing its diversity - which in turn makes the estimates draw from it less precise.

Stratification

  • This time the population is divided into groups, according to some characteristics, and a sample of units is selected from each strata.
  • Example: grouping addresses into three ‘bands’ of area deprivation
  • Stratified sampling ensures that the sample includes a certain proportion of units from the selected groups that may have been missed otherwise.
  • Unlike clusters, strata are meant to be as internally homogeneous as possible
  • Stratification tends to increases the precision of estimates, by improving the number of units from potentially less represented or harder to reach groups.

An example of stratified sampling

  • The population is divided into four strata: North, South, East and West.
  • Within each strata five sampling units (ie addresses) are selected.

Stratification

  • UK surveys are usually stratified:

    • geographically (e.g. Government Office Regions);
    • socio-economic characteristics (occupations)
    • or demography (e.g. proportion of people who are pensioners in areas).
  • Such information is usually obtained from (area-level) Census data.

Sampling fraction

  • In simple random sampling, each element drawn from the sampling frame has an equal selection probability.

  • In stratified sampling, proportionate stratification is when the same sampling fraction is used across all strata:

    • The number of units sampled is proportional to the size of a stratum
  • Disproportionate stratification means that the sampling fraction varies across strata.

    • This is is useful when a group of interest is small, i.e. less populated areas or ethnic minority groups.
  • Disproportionate sampling results in some groups being over-represented in the sample:

    • Adjustments are needed before we can analyse the data.

One slide summary

  • There is no such thing as a sampling frame - a list of all UK residents to pick from
  • Even if there were one, some people are known to less easy to reach and/or less likely to take part in survey than others.
  • Most UK social surveys rely on multi-stage clustering and stratification, alongside sampling proportionate to size
  • These strike a compromise between issues such as tackling non response, unequal probability of selection, improving the representativeness of hard to reach groups while keeping fieldwork costs down.

References